Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the
'fnlwgt'feature and records with missing or ill-formatted entries.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45222 entries, 0 to 45221 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45222 non-null int64 1 workclass 45222 non-null object 2 education_level 45222 non-null object 3 education-num 45222 non-null float64 4 marital-status 45222 non-null object 5 occupation 45222 non-null object 6 relationship 45222 non-null object 7 race 45222 non-null object 8 sex 45222 non-null object 9 capital-gain 45222 non-null float64 10 capital-loss 45222 non-null float64 11 hours-per-week 45222 non-null float64 12 native-country 45222 non-null object 13 income 45222 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 4.8+ MB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 45222 | NaN | NaN | NaN | 38.5479 | 13.2179 | 17 | 28 | 37 | 47 | 90 |
| workclass | 45222 | 7 | Private | 33307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| education_level | 45222 | 16 | HS-grad | 14783 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| education-num | 45222 | NaN | NaN | NaN | 10.1185 | 2.55288 | 1 | 9 | 10 | 13 | 16 |
| marital-status | 45222 | 7 | Married-civ-spouse | 21055 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| occupation | 45222 | 14 | Craft-repair | 6020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| relationship | 45222 | 6 | Husband | 18666 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| race | 45222 | 5 | White | 38903 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| sex | 45222 | 2 | Male | 30527 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| capital-gain | 45222 | NaN | NaN | NaN | 1101.43 | 7506.43 | 0 | 0 | 0 | 0 | 99999 |
| capital-loss | 45222 | NaN | NaN | NaN | 88.5954 | 404.956 | 0 | 0 | 0 | 0 | 4356 |
| hours-per-week | 45222 | NaN | NaN | NaN | 40.938 | 12.0075 | 1 | 40 | 40 | 45 | 99 |
| native-country | 45222 | 41 | United-States | 41292 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| income | 45222 | 2 | <=50K | 34014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Number of observations: 45222 Number of people with income > 50k: 11208 Number of people with income <= 50k: 34014 Percent of people with income > 50k: 24.78
Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.
Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.
Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.
These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).
==================================== Mapping for variable: numeric_income
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | <=50K | 0 |
| 1 | >50K | 1 |
======================================= Mapping for variable: numeric_workclass
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | State-gov | 0 |
| 1 | Self-emp-not-inc | 1 |
| 2 | Private | 2 |
| 3 | Federal-gov | 3 |
| 4 | Local-gov | 4 |
| 5 | Self-emp-inc | 5 |
| 6 | Without-pay | 6 |
============================================ Mapping for variable: numeric_marital_status
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Never-married | 0 |
| 1 | Married-civ-spouse | 1 |
| 2 | Divorced | 2 |
| 3 | Married-spouse-absent | 3 |
| 4 | Separated | 4 |
| 5 | Married-AF-spouse | 5 |
| 6 | Widowed | 6 |
======================================== Mapping for variable: numeric_occupation
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Adm-clerical | 0 |
| 1 | Exec-managerial | 1 |
| 2 | Handlers-cleaners | 2 |
| 3 | Prof-specialty | 3 |
| 4 | Other-service | 4 |
| 5 | Sales | 5 |
| 6 | Transport-moving | 6 |
| 7 | Farming-fishing | 7 |
| 8 | Machine-op-inspct | 8 |
| 9 | Tech-support | 9 |
| 10 | Craft-repair | 10 |
| 11 | Protective-serv | 11 |
| 12 | Armed-Forces | 12 |
| 13 | Priv-house-serv | 13 |
========================================== Mapping for variable: numeric_relationship
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Not-in-family | 0 |
| 1 | Husband | 1 |
| 2 | Wife | 2 |
| 3 | Own-child | 3 |
| 4 | Unmarried | 4 |
| 5 | Other-relative | 5 |
================================== Mapping for variable: numeric_race
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | White | 0 |
| 1 | Black | 1 |
| 2 | Asian-Pac-Islander | 2 |
| 3 | Amer-Indian-Eskimo | 3 |
| 4 | Other | 4 |
================================= Mapping for variable: numeric_sex
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Male | 0 |
| 1 | Female | 1 |
============================================= Mapping for variable: numeric_education_level
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Doctorate | 0 |
| 1 | Prof-school | 1 |
| 2 | Masters | 2 |
| 3 | Bachelors | 3 |
| 4 | Assoc-voc | 4 |
| 5 | Assoc-acdm | 5 |
| 6 | Some-college | 6 |
| 7 | HS-grad | 7 |
| 8 | 12th | 8 |
| 9 | 11th | 9 |
| 10 | 10th | 10 |
| 11 | 9th | 11 |
| 12 | 7th-8th | 12 |
| 13 | 5th-6th | 13 |
| 14 | 1st-4th | 14 |
| 15 | Preschool | 15 |
For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).
The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).
To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Why does this matter: The extreme points may affect the performance of the predictive model.
Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.
Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.
| Feature | Skewness | Mean | Variance | |
|---|---|---|---|---|
| 0 | Capital Loss | 4.516154 | 88.595418 | 1.639858e+05 |
| 1 | Capital Gain | 11.788611 | 1101.430344 | 5.634525e+07 |
Optimization terminated successfully.
Current function value: 0.692991
Iterations 3
Logit Regression Results
==============================================================================
Dep. Variable: numeric_income No. Observations: 45222
Model: Logit Df Residuals: 45221
Method: MLE Df Model: 0
Date: Thu, 16 Apr 2020 Pseudo R-squ.: -0.2376
Time: 22:57:36 Log-Likelihood: -31338.
converged: True LL-Null: -25322.
Covariance Type: nonrobust LLR p-value: nan
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
capital_loss 8.534e-05 2.28e-05 3.747 0.000 4.07e-05 0.000
================================================================================
| Feature | Skewness | Mean | Variance | |
|---|---|---|---|---|
| 0 | Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| 1 | Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| 2 | Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| 3 | Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |
Optimization terminated successfully.
Current function value: 0.693117
Iterations 3
Transformed model
Logit Regression Results
==============================================================================
Dep. Variable: numeric_income No. Observations: 45222
Model: Logit Df Residuals: 45221
Method: MLE Df Model: 0
Date: Thu, 16 Apr 2020 Pseudo R-squ.: -0.2378
Time: 22:57:53 Log-Likelihood: -31344.
converged: True LL-Null: -25322.
Covariance Type: nonrobust LLR p-value: nan
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
capital_loss 0.0095 0.006 1.642 0.101 -0.002 0.021
================================================================================
Original model
Logit Regression Results
==============================================================================
Dep. Variable: numeric_income No. Observations: 45222
Model: Logit Df Residuals: 45221
Method: MLE Df Model: 0
Date: Thu, 16 Apr 2020 Pseudo R-squ.: -0.2376
Time: 22:57:53 Log-Likelihood: -31338.
converged: True LL-Null: -25322.
Covariance Type: nonrobust LLR p-value: nan
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
capital_loss 8.534e-05 2.28e-05 3.747 0.000 4.07e-05 0.000
================================================================================
| Feature | Skewness | Mean | Variance |
|---|---|---|---|
| Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |
These two terms, normalization and standardization, are frequently used interchangably, but have two different scaling purposes.
Earlier, capital_gain and capital_loss were transformed logarithmically, reducing their skew, and affecting the model's predictive power (i.e. ability to discern the relationship between the dependent and independent variables).
Another method of influencing the model's predictive power is normalization of independent variables which are numerical. Whereafter, each featured will be treated equally in the model.
However, after scaling is applied, observing the data in its raw form will no longer have the same meaning as before.
| age | workclass | education_level | education_num | marital_status | occupation | relationship | race | sex | capital_gain | ... | native_country | numeric_income | numeric_workclass | numeric_marital_status | numeric_occupation | numeric_relationship | numeric_race | numeric_sex | numeric_native_country | numeric_education_level | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.301370 | State-gov | Bachelors | 0.800000 | Never-married | Adm-clerical | Not-in-family | White | Male | 0.667492 | ... | United-States | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 1 | 0.452055 | Self-emp-not-inc | Bachelors | 0.800000 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0.000000 | ... | United-States | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 3 |
| 2 | 0.287671 | Private | HS-grad | 0.533333 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0.000000 | ... | United-States | 0 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | 7 |
| 3 | 0.493151 | Private | 11th | 0.400000 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0.000000 | ... | United-States | 0 | 2 | 1 | 2 | 1 | 1 | 0 | 0 | 9 |
| 4 | 0.150685 | Private | Bachelors | 0.800000 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0.000000 | ... | Cuba | 0 | 2 | 1 | 3 | 2 | 1 | 1 | 1 | 3 |
5 rows × 22 columns